Movie Rating Model and Predictor
Part 5: Modeling
The aim at this stage was to develop two prediction models. Model One, a simple linear regression model identified which of the quantitative variables was the best predictor of the log of daily box office revenue. The best predictor for the log of daily box office revenue identified by Model One was designated the response variable for Model Two: a multiregression model that selected the best predictors for the designated response variable. The latter was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:Table 1: Multiregression prediction models
| Model | Model.Selection | Data |
|---|---|---|
| Alpha | Forward Selection | Full model |
| Beta | Forward Selection | Full model, influential outliers removed |
| Gamma | Backward Elimination | Full model |
| Delta | Backward Elimination | Full model, influential outliers removed |
The remainder of this sections is organized as follows.
Model One: Simple Linear Regression Model
1.1. Model Selection
1.2. Model Diagnostics
1.3. Model InterpretationModel Two: Multiregression Model
2.1. Model Selection Methods
2.2. Full Model
2.3. Model Alpha
2.4. Model Beta
2.5. Model Gamma
2.6. Model Delta
2.7. Model Comparison
2.8. Model Two: Final Multiregression ModelModel Summary
Model One: Simple Linear Regression
Model Selection
Several simple linear models were fit to determine which of the following quantititive variables in Table 2 was the best predictor of the log of daily box office revenue.
Table 3: Simple linear regression variables| Variable | Description |
|---|---|
| audience_score | Audience score on Rotten Tomatoes |
| cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_experience_log | Log of the sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_scores | Total number of allocated audience and IMDB scores per day for the cast of a film |
| cast_scores_log | Log of cast_scores |
| cast_votes | Total number of allocated IMDB votes per day for the cast of a film |
| cast_votes_log | Log of cast_votes |
| critics_score | Critics score on Rotten Tomatoes |
| director_experience | Total number of films in sample for a director |
| director_experience_log | Log of the total number of films directed by the film’s director |
| imdb_num_votes | Number of votes on IMDB |
| imdb_num_votes_log | Log number of IMDB votes |
| imdb_rating | Rating on IMDB |
| runtime | Runtime of movie (in minutes) |
| runtime_log | Log runtime of movie (in minutes) |
| votes_per_day | The number of IMDB Votes / thtr_days |
| votes_per_day_log | Log of votes_per_day |
As suggested by the correlation analysis in Table 4 and summarized in Table 5 the log number of IMDB votes was the best predictor of the log of daily box office revenue (F(1, 177) = 332.237, p < .001), with an adjusted R-Squared of 0.65. The model accounted for 65% of the variance in the response.
Table 5: Best performing simple linear regression on log of box office revenue| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| imdb_num_votes_log | 1 | 1355.05 | 1355.05 | 332.24 | 0 | 65.24 |
| Residuals | 177 | 721.90 | 4.08 | NA | NA | 34.76 |
Model Diagnostics
Linearity
The linearity of the predictor with the log of daily box office is illustrated in Figure 1.
Figure 1 Model One linearity plot
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(2), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 2) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 2 Model One homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.198). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 3 illustrate the distribution of residuals.
Figure 3 Model One residuals plot
The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.996, p = 0.885) and the skewness (-0.016) and kurtosis (3.147) indicated that normality of residuals was not a reasonable assumption for this model.
Outliers
Figure 4 Model One Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 10 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.
Model Interpretation
The final prediction equation was defined as follows:
\(y_i\) = -7.29 + 1.24\(x_1\) + \(\epsilon\)
where:
\(x_1\) is imdb_num_votes_log
Analysis of Variance
Figure 5 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| imdb_num_votes_log | 1 | 1355.045 | 1355.045 | 332.237 | 0 | 65.24 |
| Residuals | 177 | 721.903 | 4.079 | NA | NA | 34.76 |
Figure 5 Model Alpha analysis of variance
A two-way analysis of variance was conducted on the influence of 1 independent variable on the log daily box office. The significance of imdb_num_votes_log on the log daily box office yielded an F statistic of F(1, 177), = 332.237, p < .001, accounting for 65.24% of the variance. Finally, residuals accounted for a 34.76% of variance. The model was significant (F(2, 177) = 332.237, p < .001), with an adjusted R-squared of 0.65.
Interpretation of Coefficients
The intercept -7.29 is the prediction of log daily box office revenue for a film where the log number of IMDB votes is zero. The prediction of the log daily box office (in log dollars) is therefore, -7.29 plus 1.24 log dollars of daily box office revenue for each log IMDB vote.
Model Two: Multiple Linear Regression
Model Two was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.
Model Selection Methods
Both forward selection and backward elimination with p-values model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.
Forward Selection
The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.
Backward Elimination
The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removie only the most least significant predictor at each step, until all predictors had p-values below the present threshold.
Full Model Selection
Since the objective of the analysis was to determine what factors make a movie popular, the full model did not include variables that could be considered proxies of popularity such as audience rating or IMDB rating. Such ratings are measures of a film’s popularity, not predictors. Critics rating, on the other hand, was considered not a measure, but a potential leading indicator of movie popularity. Similarly, effort was made to capture the popularity of specific cast members to test the hypothesis that a cast’s aggregate popularity could influence the popularity of a film. That said, the criteria for excluding a variable from the full model was as follows:
* Measures of film popularity such as the audience rating, IMDB rating and top 200 box office variables
* Categorical variables with levels including less than 5 observations, such as title, url, studio, and the actor variables
* The year and day of theatrical or dvd release
| Type | Variable | Description |
|---|---|---|
| Categorical | best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| Categorical | best_actress_win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| Categorical | best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| Categorical | best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| Categorical | best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| Categorical | thtr_rel_month | Month the movie is released in theaters |
| Numeric | cast_scores | Total number of allocated audience and IMDB scores per day for the cast of a film |
| Numeric | cast_votes_log | Log of cast_votes |
| Numeric | critics_score | Critics score on Rotten Tomatoes |
| Numeric | director_experience_log | Log of the total number of films directed by the film’s director |
| Numeric | runtime_log | Log runtime of movie (in minutes) |
| Numeric | votes_per_day_log | Log of votes_per_day |
The following sections explore various models, model selection techniques, and model diagnostics. Comparisons are conducted and the models are evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.
Model Alpha
For this model, a forward selection procedure was undertaken based upon the full model described above. Table 7 lists the variables in the order in which they were added.
Table 7: Model Alpha forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | cast_scores | 1 | 2 481 | 110.31 | 0.19 | 0.18 | 0 | 0.00 |
| 2 | genre | 2 | 12 471 | 20.52 | 0.32 | 0.31 | 0 | 66.49 |
| 3 | critics_score | 3 | 13 470 | 21.46 | 0.35 | 0.34 | 0 | 9.74 |
| 4 | mpaa_rating | 4 | 17 466 | 17.73 | 0.38 | 0.36 | 0 | 5.62 |
| 5 | cast_votes | 5 | 18 465 | 17.38 | 0.39 | 0.37 | 0 | 2.52 |
| 6 | best_pic_nom | 6 | 19 464 | 16.98 | 0.40 | 0.37 | 0 | 2.19 |
| 7 | runtime_log | 7 | 20 463 | 16.42 | 0.40 | 0.38 | 0 | 1.07 |
| 8 | best_dir_win | 8 | 21 462 | 15.76 | 0.41 | 0.38 | 0 | 0.53 |
As indicated in Table 8 and graphically depicted in Figure 6, the model was significant (F(21, 462) = 15.759, p < .001), with an adjusted R-squared of 0.38.
Table 8: Model Alpha Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 8 | 21 | 462 | 15.759 | 1.85 | 1.85 | 0.406 | 0.38 | 0 | 40.554 |
Figure 6 Model Alpha Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 7.
Figure 7 Model Alpha linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(21), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 8) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 8 Model Alpha homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.253). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 9 illustrate the distribution of residuals.
Figure 9 Model Alpha residuals plot
The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.992, p = 0.008) and the skewness (0.147) and kurtosis (2.444) indicated that normality of residuals was not a reasonable assumption for this model.
Multicollinearity
As shown in Figure 10 and Table 9, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 10: Model Alpha correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| cast_scores | 2.545 | 1 | 1.595 |
| genre | 3.024 | 10 | 1.057 |
| critics_score | 1.427 | 1 | 1.195 |
| mpaa_rating | 2.516 | 4 | 1.122 |
| cast_votes | 2.281 | 1 | 1.510 |
| best_pic_nom | 1.136 | 1 | 1.066 |
| runtime_log | 1.429 | 1 | 1.195 |
| best_dir_win | 1.100 | 1 | 1.049 |
Outliers
Figure 11 Model Alpha Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 23 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.
Model Beta
This was also a forward selecion model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 10
Table 10: Model Beta forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | cast_scores | 1 | 2 458 | 116.97 | 0.20 | 0.20 | 0 | 0.00 |
| 2 | genre | 2 | 12 448 | 21.80 | 0.35 | 0.33 | 0 | 64.85 |
| 3 | critics_score | 3 | 13 447 | 22.64 | 0.38 | 0.36 | 0 | 8.41 |
| 4 | cast_votes | 4 | 14 446 | 22.75 | 0.40 | 0.38 | 0 | 5.54 |
| 5 | mpaa_rating | 5 | 18 442 | 18.65 | 0.42 | 0.40 | 0 | 3.67 |
| 6 | best_pic_nom | 6 | 19 441 | 18.31 | 0.43 | 0.40 | 0 | 2.28 |
| 7 | runtime_log | 7 | 20 440 | 17.74 | 0.43 | 0.41 | 0 | 1.24 |
| 8 | best_pic_win | 8 | 21 439 | 17.01 | 0.44 | 0.41 | 0 | 0.49 |
As indicated in Table 11 and graphically depicted in Figure 12, the model was significant (F(21, 439) = 17.013, p < .001), with an adjusted R-squared of 0.411.
Table 11: Model Beta Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Beta | 8 | 21 | 439 | 17.013 | 1.759 | 1.759 | 0.437 | 0.411 | 0 | 43.664 |
Figure 12 Model Beta Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 13.
Figure 13 Model Beta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(21), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 14) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 14 Model Beta homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.658). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 15 illustrate the distribution of residuals.
Figure 15 Model Beta residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.995, p = 0.131) and the skewness (0.127) and kurtosis (2.604) supported the assumption of normaility.
Multicollinearity
As shown in Figure 16 and Table 12, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3.3 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 16: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| cast_scores | 2.569 | 1 | 1.603 |
| genre | 3.308 | 10 | 1.062 |
| critics_score | 1.412 | 1 | 1.188 |
| cast_votes | 2.306 | 1 | 1.518 |
| mpaa_rating | 2.448 | 4 | 1.118 |
| best_pic_nom | 1.320 | 1 | 1.149 |
| runtime_log | 1.519 | 1 | 1.233 |
| best_pic_win | 1.230 | 1 | 1.109 |
Outliers
Figure 17 Model Beta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 24 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
Model Gamma
For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 13
Table 13: Model Gamma| Steps | Removed | p.value |
|---|---|---|
| 1 | director_experience | 0.93 |
| 2 | cast_scores_log | 0.93 |
| 3 | director_experience_log | 0.69 |
| 4 | cast_votes_log | 0.66 |
| 5 | best_actress_win | 0.34 |
The model therefore retained the following variables:
Table 14 Model Gamma Variables| Variable | Description |
|---|---|
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| runtime_log | Log runtime of movie (in minutes) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| thtr_rel_month | Month the movie is released in theaters |
| critics_score | Critics score on Rotten Tomatoes |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| cast_votes | Total number of allocated IMDB votes per day for the cast of a film |
| cast_scores | Total number of allocated audience and IMDB scores per day for the cast of a film |
As indicated in Table 15 and graphically depicted in Figure 18, the model was significant (F(34, 449) = 9.866, p < .001), with an adjusted R-squared of 0.378.
Table 15 Model Gamma Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Gamma | 11 | 34 | 449 | 9.866 | 1.853 | 1.853 | 0.42 | 0.378 | 0 | 42.032 |
Figure 18 Model Gamma Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 19.
Figure 19 Model Gamma linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(34), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 20) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 20 Model Gamma homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.29). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 21 illustrate the distribution of residuals.
Figure 21 Model Gamma residuals plot
The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.99, p = 0.003) and the skewness (0.149) and kurtosis (2.397) indicated that normality of residuals was not a reasonable assumption for this model.
Multicollinearity
As shown in Figure 22 and Table 16, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3.9 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 22: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 3.854 | 10 | 1.070 |
| mpaa_rating | 2.760 | 4 | 1.135 |
| thtr_rel_month | 1.752 | 11 | 1.026 |
| best_pic_nom | 1.445 | 1 | 1.202 |
| best_pic_win | 1.365 | 1 | 1.168 |
| best_actor_win | 1.297 | 1 | 1.139 |
| best_dir_win | 1.212 | 1 | 1.101 |
| critics_score | 1.457 | 1 | 1.207 |
| runtime_log | 1.551 | 1 | 1.246 |
| cast_scores | 2.614 | 1 | 1.617 |
| cast_votes | 2.429 | 1 | 1.559 |
Outliers
Figure 23 Model Gamma Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 17 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.
Model Delta
This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. The variables were removed as described in Table 17
Table 17: Model Delta| Steps | Removed | p.value |
|---|---|---|
| 1 | director_experience | 0.99 |
| 2 | cast_scores_log | 0.83 |
| 3 | director_experience_log | 0.78 |
| 4 | cast_votes_log | 0.49 |
| 5 | best_actress_win | 0.29 |
The model therefore retained the following variables:
Table 18 Model Delta Variables| Variable | Description |
|---|---|
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| runtime_log | Log runtime of movie (in minutes) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| thtr_rel_month | Month the movie is released in theaters |
| critics_score | Critics score on Rotten Tomatoes |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| cast_votes | Total number of allocated IMDB votes per day for the cast of a film |
| cast_scores | Total number of allocated audience and IMDB scores per day for the cast of a film |
As indicated in Table 19 and graphically depicted in Figure 24, the model was significant (F(34, 432) = 10.953, p < .001), with an adjusted R-squared of 0.414.
Table 19| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Delta | 11 | 34 | 432 | 10.953 | 1.761 | 1.761 | 0.456 | 0.414 | 0 | 45.554 |
Figure 24 Model Delta Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 25.
Figure 25 Model Delta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(34), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 26) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 26 Model Delta homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.499). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 27 illustrate the distribution of residuals.
Figure 27 Model Delta residuals plot
The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.991, p = 0.006) and the skewness (0.203) and kurtosis (2.506) indicated that normality of residuals was not a reasonable assumption for this model.
Multicollinearity
As shown in Figure 28 and Table 20, collinearity appeared extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4.1 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 28: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 4.102 | 10 | 1.073 |
| mpaa_rating | 2.655 | 4 | 1.130 |
| thtr_rel_month | 1.848 | 11 | 1.028 |
| best_pic_nom | 1.458 | 1 | 1.207 |
| best_pic_win | 1.366 | 1 | 1.169 |
| best_actor_win | 1.281 | 1 | 1.132 |
| best_dir_win | 1.226 | 1 | 1.107 |
| critics_score | 1.463 | 1 | 1.210 |
| runtime_log | 1.651 | 1 | 1.285 |
| cast_scores | 2.626 | 1 | 1.620 |
| cast_votes | 2.482 | 1 | 1.575 |
Outliers
Figure 29 Model Delta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 18 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
Model Comparisons
To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.
Table 21 Summary of models| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 8 | 21 | 462 | 15.759 | 1.850 | 1.850 | 0.406 | 0.380 | 0 | 40.554 |
| Model Beta | 8 | 21 | 439 | 17.013 | 1.759 | 1.759 | 0.437 | 0.411 | 0 | 43.664 |
| Model Gamma | 11 | 34 | 449 | 9.866 | 1.853 | 1.853 | 0.420 | 0.378 | 0 | 42.032 |
| Model Delta | 11 | 34 | 432 | 10.953 | 1.761 | 1.761 | 0.456 | 0.414 | 0 | 45.554 |
Forward Selection vs. Backward Elimination
The differences in root mean square error for the models was not significant 0.17% and -0.12%. Similarly, the differences in adjusted R-squared were 0.55% and 0.72%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (3.64% and 4.33%).
Influential Points: Drop or Not
The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (5.21% and 5.25%) were insignificant, as were the differences in adjusted R-squared (8.21% and 9.59%), and the percent of variance explained (7.67% and 8.38%). However, a case-wise review of the influential points did not reveal any data quality issues; therefore, the points would not be removed.
Prediction Accuracy
The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:
- MAPE - Mean Absolute Percentage Error
- MPE - Mean Percentage Error
- MSE - Mean Squared Error
- RMSE - Root Mean Squared Error
In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.
Table 22 Model Predictive Accuracy Summary| Model | Size | F Statistic | R-Squared | Adj R-Squared | % Variance | MAPE | MPE | MSE | RMSE | % Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 8 | 15.759 | 0.406 | 0.380 | 40.554 | 11.489 | -2.906 | 3.524 | 1.877 | 95.902 |
| Model Beta | 8 | 17.013 | 0.437 | 0.411 | 43.664 | 11.466 | -2.979 | 3.519 | 1.876 | 94.262 |
| Model Gamma | 11 | 9.866 | 0.420 | 0.378 | 42.032 | 11.365 | -3.315 | 3.461 | 1.860 | 95.902 |
| Model Delta | 11 | 10.953 | 0.456 | 0.414 | 45.554 | 11.185 | -3.278 | 3.419 | 1.849 | 96.721 |
There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicated that all models were biased with over predictions. From a percent accuracy perspective, it is worth noting that the forward selection and backward selection models performed nearly identically with and without the influence points. That said, the models with the influence points had 0 greater prediction accuracy. Having the highest percent accuracy, the Alpha model would advance to the prediction stage.
Model Two: Final Multiregression Model
The final prediction equation was defined as follows: \(y_i\) = 8.355864 + 0.001\(x_1\) + -0.743\(x_2\) + -2.541\(x_3\) + -1.329\(x_4\) + -4.065\(x_5\) + -2.028\(x_6\) + -0.681\(x_7\) + -3.629\(x_8\) + -1.454\(x_9\) + -1.818\(x_{10}\) + -0.528\(x_{11}\) + 0.017\(x_{12}\) + -0.108\(x_{13}\) + 0.751\(x_{14}\) + 0.363\(x_{15}\) + -0.847\(x_{16}\) + 0\(x_{17}\) + 1.052\(x_{18}\) + 0.765\(x_{19}\) + 0.524\(x_{20}\) + \(\epsilon\)
where: \(x_1\) is cast_scores
\(x_2\) is genreAnimation
\(x_3\) is genreArt House & International
\(x_4\) is genreComedy
\(x_5\) is genreDocumentary
\(x_6\) is genreDrama
\(x_7\) is genreHorror
\(x_8\) is genreMusical & Performing Arts
\(x_9\) is genreMystery & Suspense
\(x_{10}\) is genreOther
\(x_{11}\) is genreScience Fiction & Fantasy
\(x_{12}\) is critics_score
\(x_{13}\) is mpaa_ratingPG
\(x_{14}\) is mpaa_ratingPG-13
\(x_{15}\) is mpaa_ratingR
\(x_{16}\) is mpaa_ratingUnrated
\(x_{17}\) is cast_votes
\(x_{18}\) is best_pic_nomyes
\(x_{19}\) is runtime_log
\(x_{20}\) is best_dir_winyes
The genre, MPAA rating and month of release variables were code 0 or 1 in accordance with the genre, MPAA rating and month of release for each observation.
Analysis of Variance
Figure 30 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| cast_scores | 1 | 496.294 | 496.294 | 144.988 | 0.000 | 18.66 |
| genre | 10 | 365.435 | 36.544 | 10.676 | 0.000 | 13.74 |
| critics_score | 1 | 80.027 | 80.027 | 23.379 | 0.000 | 3.01 |
| mpaa_rating | 4 | 64.886 | 16.222 | 4.739 | 0.001 | 2.44 |
| cast_votes | 1 | 27.065 | 27.065 | 7.907 | 0.005 | 1.02 |
| best_pic_nom | 1 | 22.831 | 22.831 | 6.670 | 0.010 | 0.86 |
| runtime_log | 1 | 14.223 | 14.223 | 4.155 | 0.042 | 0.53 |
| best_dir_win | 1 | 8.099 | 8.099 | 2.366 | 0.125 | 0.30 |
| Residuals | 462 | 1581.424 | 3.423 | NA | NA | 59.45 |
Figure 30 Model Alpha analysis of variance
A two-way analysis of variance was conducted on the influence of 8 independent variables on the log imdb votes. The force of cast_scores on the log imdb votes indicated an F statistic of F(1, 462), = 144.988, p < .001, accounting for 18.66% of the variance. The significance of genre on the log imdb votes presented an F statistic of F(10, 462), = 10.676, p < .001, exhibiting 13.74% of the variance. The significance of critics_score on the log imdb votes produced an F statistic of F(1, 462), = 23.379, p < .001, accounting for 3.01% of the variance. The influence of mpaa_rating on the log imdb votes produced an F statistic of F(4, 462), = 4.739, p < .001, representing 2.44% of the variance. The significance of cast_votes on the log imdb votes presented an F statistic of F(1, 462), = 7.907, p < .01, representing 1.02% of the variance. The force of best_pic_nom on the log imdb votes presented an F statistic of F(1, 462), = 6.67, p < .05, representing 0.86% of the variance. The significance of runtime_log on the log imdb votes indicated an F statistic of F(1, 462), = 4.155, p < .05, exhibiting 0.53% of the variance. The effect of best_dir_win on the log imdb votes produced an F statistic of F(1, 462), = 2.366, p < 0.125, exhibiting 0.3% of the variance. Finally, residuals represented a 59.45% of variance. The model was significant (F(21, 462) = 15.759, p < .001), with an adjusted R-squared of 0.38.
Interpretation of Coefficients
Although there are only 8 variables, there are some 21 coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in Table 23.
Table 23: Model Alpha Coefficients| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.356 | 2.816 | 2.967 | 0.003 |
| cast_scores | 0.001 | 0.000 | 3.427 | 0.001 |
| genreAnimation | -0.743 | 0.856 | -0.869 | 0.386 |
| genreArt House & International | -2.541 | 0.638 | -3.981 | 0.000 |
| genreComedy | -1.329 | 0.346 | -3.845 | 0.000 |
| genreDocumentary | -4.065 | 0.498 | -8.162 | 0.000 |
| genreDrama | -2.028 | 0.296 | -6.848 | 0.000 |
| genreHorror | -0.681 | 0.568 | -1.200 | 0.231 |
| genreMusical & Performing Arts | -3.629 | 0.762 | -4.763 | 0.000 |
| genreMystery & Suspense | -1.454 | 0.380 | -3.829 | 0.000 |
| genreOther | -1.818 | 0.604 | -3.008 | 0.003 |
| genreScience Fiction & Fantasy | -0.528 | 0.707 | -0.746 | 0.456 |
| critics_score | 0.017 | 0.004 | 4.900 | 0.000 |
| mpaa_ratingPG | -0.108 | 0.609 | -0.177 | 0.860 |
| mpaa_ratingPG-13 | 0.751 | 0.625 | 1.202 | 0.230 |
| mpaa_ratingR | 0.363 | 0.602 | 0.602 | 0.547 |
| mpaa_ratingUnrated | -0.847 | 0.717 | -1.182 | 0.238 |
| cast_votes | 0.000 | 0.000 | 2.698 | 0.007 |
| best_pic_nomyes | 1.052 | 0.462 | 2.279 | 0.023 |
| runtime_log | 0.765 | 0.427 | 1.793 | 0.074 |
| best_dir_winyes | 0.524 | 0.341 | 1.538 | 0.125 |
The intercept estimate, 8.356 , is the regression estimate for the mean log number of IMDB votes for an action and adventure film, launched in January with no oscar wins or nominations and zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDB votes is equal to: * the intercept value, 8.356, * plus 0.001 log IMDB votes for each composite score point earned by the cast members, * plus a number of log IMDB votes associated with the genre of the film, * plus 0.017 log IMDB votes for each point of the Rottentomatoes critics score, * plus a number of log IMDB votes for the associated MPAA rating, * plus 0 log IMDB votes for the each vote previously earned by the cast members, * plus 1.052 log IMDB votes if the film was nominated for an oscar for best film. * plus 0.765 log IMDB votes for each log minute of runtime, * plus 0.524 log IMDB votes if the film won an oscar for best picture.
Model Summary
The purpose of this section was to develop a model that would be able to predict “box office success”. Given the signficant right skew in box office revenue, the log of box office revenue became the proxy for box office success. Therefore, two regression models were fit in this section. Model One, the simple linear regression model (F(2, 177) = 332.24, p < .001) showed that the log number of IMDB votes was the best predictor of the log of box office revenue. Designated log IMDB votes as the response variable, Model Two (F(21, 462) = 15.76, p < .001) was selected from among four multiregression linear models employing forward selection and backward elimination algorithms. Next, the models will be used to predict the number of log IMDB votes and the log box office for a randomly selected film.